3 Data manipulation: dplyr

First, load the library:

library(tidyverse)

3.1 Data

In this chapter we will use the following datasets.

3.1.1 Misspelling dataset

This dataset I gathered after some manipulations with data from The Gyllenhaal Experiment By Russell Goldenberg and Matt Daniels for pudding. They analysed mistakes in spellings of celebrities during the search.

misspellings <- read_csv("https://raw.githubusercontent.com/agricolamz/2020.02_Naumburg_R/master/data/misspelling_dataset.csv")
## Parsed with column specification:
## cols(
##   correct = col_character(),
##   spelling = col_character(),
##   count = col_double()
## )
misspellings

There are the following variables in this dataset:

  • correct — correct spelling
  • spelling — user’s spelling
  • count — number of cases of user’s spelling

3.1.2 diamonds

diamonds — is the dataset built-in in tidyverse package.

diamonds
?diamonds

3.2 dplyr

Here and here is a cheatsheet on dplyr.

3.2.1 filter()

This function filter rows by some condition.

How many wrong spellings that were used by less then 10 users?

misspellings %>%
  filter(count < 10)

%>% it is pipe. It allow to chain operations, puting the output of one function into the input of another:

sort(sqrt(abs(sin(1:22))), decreasing = TRUE)
##  [1] 0.9999951 0.9952926 0.9946649 0.9805088 0.9792468 0.9554817 0.9535709
##  [8] 0.9173173 0.9146888 0.8699440 0.8665952 0.8105471 0.8064043 0.7375779
## [15] 0.7325114 0.6482029 0.6419646 0.5365662 0.5285977 0.3871398 0.3756594
## [22] 0.0940814
1:22 %>% 
  sin() %>% 
  abs() %>% 
  sqrt() %>% 
  sort(., decreasing = TRUE) # why do we need a dot here?
##  [1] 0.9999951 0.9952926 0.9946649 0.9805088 0.9792468 0.9554817 0.9535709
##  [8] 0.9173173 0.9146888 0.8699440 0.8665952 0.8105471 0.8064043 0.7375779
## [15] 0.7325114 0.6482029 0.6419646 0.5365662 0.5285977 0.3871398 0.3756594
## [22] 0.0940814

Pipes that are used in tidyverse are from the package magrittr. Sometimes pipe could work not well with functions outside the tidyverse.

3.2.2 slice()

This function filter rows by its index.

misspellings %>%
  slice(3:7)

3.2.3 select()

This functions for choosing variables from dataframe.

diamonds %>%
  select(8:10)
diamonds %>%
  select(color:price)
diamonds %>%
  select(-carat)
diamonds %>%
  select(-c(carat, cut, x, y, z))
diamonds %>%
  select(cut, depth, price)

3.2.4 arrange()

This function order rows in dataframe (numbers — by order, strings — alphabeticly).

misspellings %>%
  arrange(count)
diamonds %>%
  arrange(desc(carat), price)
diamonds %>%
  arrange(-carat, price)

3.2.5 distinct()

This function retern only unique rows from an input dataframe.

misspellings %>%
  distinct(correct)
misspellings %>%
  distinct(spelling)
diamonds %>%
  distinct(color, cut)

In built-in dataset starwars filter those characters that are higher then 180 (height) and weigh less then 80 (mass). Then get a unique names of their homeworlds (homeworld).

3.2.6 mutate()

This functions creates new variables.

misspellings %>%
  mutate(misspelling_length = nchar(spelling),
         id = 1:n())
Create a variable with body mass index индексом Кетле: \(\frac{mass}{height^2}\) for all characters from starwars dataset. How many charachters have obesity (have body mass index greater 30)? (Don’t forget to convert height from centimetres to metres).

3.2.7 group_by(...) %>% summarise(...)

This function allows to group variables by some columns adn get some discriptive statistics (maximum, minimum, last value, first value, mean, median etc.)

misspellings %>%
  summarise(min(count), mean(count))
misspellings %>%
  group_by(correct) %>% 
  summarise(mean(count))
misspellings %>%
  group_by(correct) %>% 
  summarise(my_mean = mean(count))

If you need to calculate number of cases, use the function n() in summarise() or the count() function:

misspellings %>%
  group_by(correct) %>% 
  summarise(n = n())
misspellings %>%
  count(correct)

It is even possible to srot the result, using sort argument:

misspellings %>%
  count(correct, sort = TRUE)

In case you don’t want to have any summary, but an additional column, just replace summarise() with mutate()

misspellings %>%
  group_by(correct) %>% 
  mutate(my_mean = mean(count))

Here is a scheme:

In the starwars dataset create a variable that contain mean height value for each species.

3.3 Merging dataframes

3.3.1 bind_...

This is a family of functions that make it possible to merge dataframes together:

my_tbl <- tibble(a  = c(1, 5, 2), 
                 b = c("e", "g", "s"))

Here is how to merge two datasets by row:

my_tbl %>% 
  bind_rows(my_tbl)

In case there is an absent column, values will be filled with NA:

my_tbl %>% 
  bind_rows(my_tbl[,-1])

In order to merge dataframes by column you need another function:

my_tbl %>% 
  bind_cols(my_tbl)

In case there is an absent row, this function will return an error:

my_tbl %>% 
  bind_cols(my_tbl[-1,])
## Error: Argument 2 must be length 3, not 2

3.3.2 .._join()

These functions allow to merge different datasets by some column (or columns in common).

languages <- data_frame(
  languages = c("Selkup", "French", "Chukchi", "Polish"),
  countries = c("Russia", "France", "Russia", "Poland"),
  iso = c("sel", "fra", "ckt", "pol")
  )
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
languages
country_population <- data_frame(
  countries = c("Russia", "Poland", "Finland"),
  population_mln = c(143, 38, 5))
country_population
inner_join(languages, country_population)
## Joining, by = "countries"
left_join(languages, country_population)
## Joining, by = "countries"
right_join(languages, country_population)
## Joining, by = "countries"
anti_join(languages, country_population)
## Joining, by = "countries"
anti_join(country_population, languages)
## Joining, by = "countries"
full_join(country_population, languages)
## Joining, by = "countries"

3.4 tidyr package

Here is a dataset with number of speakers of some language of India according census 2001 (data from Wikipedia):

langs_in_india_short <- read_csv("https://raw.githubusercontent.com/agricolamz/2020.02_Naumburg_R/master/data/languages_in_india.csv")
## Parsed with column specification:
## cols(
##   language = col_character(),
##   n_L1_sp = col_double(),
##   n_L2_sp = col_double(),
##   n_L3_sp = col_double(),
##   n_all_sp = col_double()
## )
  • Short format
langs_in_india_short
  • Long format
  • Short format → Long format: tidyr::pivot_longer()
langs_in_india_short %>% 
  pivot_longer(names_to = "type", values_to = "n_speakers", n_L1_sp:n_all_sp)->
  langs_in_india_long

langs_in_india_long
  • Long format → Short format: tidyr::pivot_wider()
langs_in_india_long %>% 
  pivot_wider(names_from = "type", values_from = "n_speakers")->
  langs_in_india_short
langs_in_india_short

Here is data, that contain information about villages of Daghestan in .xlsx format. Data separated by different sheets and contain the following variables (data obtained from different sources, so they have suffixes _s1 – first source and _s2 – second source):

  • id_s1 – (s1) identification number from first source;
  • name_1885 – (s1) name of the village according the 1885 census
  • census_1885 – (s1) population according the 1885 census
  • name_1895 – (s1) name of the village according the 1895 census
  • census_1895 – (s1) population according the 1895 census
  • name_1926 – (s1) name of the village according the 1926 census
  • census_1926 – (s1) population according the 1926 census
  • name_2010 – (s1) name of the village according the 2010 census
  • census_2010 – (s1) population according the 2010 census
  • language_s1 – (s1) language name according the first source
  • name_s2 – (s2) village name according the second source
  • language_s2 – (s2) language name according the second source
  • Lat – (s2) latitude
  • Lon – (s2) longitude
  • elevation – (s2) altitude

First, merge all sheets fromt the .xlsx file:

Second, caclulate how many times language name is the same in both sources.

Third, calculate mena altitude for languages from the first source. Which is the highest?

Fourth, calculate population for languages from the second source in each census. Show the values obtained for the Lak language: